System and method for format-agnostic document ingestion

ABSTRACT

A system for format-agnostic document ingestion including a document ingestion server and a database is disclosed. The server is configured to receive an image of a document comprising text in an unknown format, convert the image, using OCR, into a plurality of text elements a content, a size, and an absolute position. The server is also configured to retrieve data detectors from the database, each associated with a data type anticipated to be in the document, and comprising at least one identifier and direction, and at least one validation criteria. The server is also configured to identify a potential descriptor by comparing the content of each text element with the at least one identifier, and then determine if the text element pointed to by the data detector meets the validation criteria. Finally, the server is configured to associate the validated text element with the data detector, and store the content.

RELATED APPLICATIONS

This application is a continuation of U.S. Pat. Application No.16/775,051, filed Jan. 28, 2020, which claims the benefit of U.S.Provisional Pat. Application No. 62/797,710, filed Jan. 28, 2019 titled“System and Method for Medical Bill Management,” the contents of each ofwhich are incorporated by reference in their entireties.

TECHNICAL FIELD

Aspects of this document relate generally to format-agnostic documentingestion.

BACKGROUND

Although there is increasing interest in moving various workflows andprocesses to be entirely electronic, a great deal of interactionsbetween companies and individuals involves human-readable documents,either “analog” (e.g. paper documents) or digital (e.g. PDF documents).Even among documents of the same type, such as shipping manifests forfreight delivery services, each party will execute that task in aslightly different manner, resulting in everyone having a differentdocument format.

This has been a long-standing problem, and is one that complicatesefforts to ingest a large number of documents, whether to digitize oldphysical records or to gather documents from various sources. Whilethese tasks may be accomplished manually on a small scale, those methodsdo not scale well, and quickly get bogged down when trying to coordinatemultiple document formats.

Conventional solutions to this problem have relied on systems thatunderstand the various document formats. However, these conventionalsystems require training, and are fragile. A small change in documentformat through such systems into chaos. While slightly more scalablethan manual effort, conventional solutions require a level ofpreparation and upkeep that often negates most of the efficiency of theautomation.

SUMMARY

According to one aspect, a system for format-agnostic document ingestionincludes a document ingestion server having a processor and a memory,the document ingestion server communicatively coupled to a database, andthe processor configured to receive an image of a document, the documentincluding text arranged in an unknown format. The processor is alsoconfigured to convert, using optical character recognition, the image ofthe document into a plurality of text elements, each text elementincluding a content, a size, and an absolute position within thedocument. The processor is also configured to identify a document typeby searching the content of each text element for a plurality ofdistinguishing strings, each distinguishing string being unique to onedocument type, as well as retrieve a plurality of data detectors fromthe database based on the document type, each data detector associatedwith a data type that is anticipated to be in the document. Each datadetector includes at least one identifier that is one of a potentiallabel and a potential format, at least one direction describing apotential relative direction of a text element having a label associatedwith the data detector, and at least one validation criteria. Eachvalidation criteria describes one of a valid format and a valid range.The processor is configured to determine a source of the document bycomparing the at least one identifier of a data detector associated witha data type that is unique among potential document sources with thecontent of each text element of the plurality of text elements. Theprocessor is configured to, for each data detector, order at least oneof the identifiers and the directions according to a history stored inthe database and associated with the source, and identify a table withinthe document by calculating for each text element of the plurality oftext elements a relative position of at least one neighboring textelement relative to the text element using the absolute position of thetext element, and comparing the relative positions of the plurality oftext elements. Furthermore, the processor is configured to locate aheader for the table by comparing the content of the text elementswithin the table with the identifiers of the plurality of data detectorsand then identifying the data type of the matching text elements, theheader being one of a row and a column, and validate, for eachidentified text element in the header, at least one text element withinthe other of a row and a column described by the identified text elementin the header with the validation criteria of the data detector thatidentified the identified text element in the header. The processor isconfigured to also associate, for each identified text element in theheader, at least one validated text element within the other of the rowand the column described by the identified text element in the headerwith the data detector that identified the identified text element inthe header, and identify a potential descriptor by comparing the contentof each text element not part of the table with the at least oneidentifier of at least one data detector. The processor is alsoconfigured to determine if the text element pointed to by one of the atleast one direction of the data detector used to identify the potentialdescriptor meets the validation criteria of the data detector, associatethe validated text element with the data detector, and store, for eachtext element associated with one data detector of the plurality of datadetector, the content of the text element, in the database. Lastly, theprocessor is configured to update, for each data detector, the historyassociated with the source, according to which identifier of the atleast one identifier and which direction of the at least one directionmatched the most text elements of the data type described by the datadetector in the document.

Particular embodiments may comprise one or more of the followingfeatures. The processor may be further configured to train a machinelearning model correlating text elements with the data detectors theyhave been associated with, determine whether the machine learning modelperforms better than one or more data detectors, and/or automaticallyemploy the machine learning model in place of the one or more datadetectors once the machine learning model outperforms the one or moredata detectors. Determining the source of the document may includeidentifying all postal addresses in the document based upon an observedformat, validating each postal address, placing each postal address in astandard format, and/or comparing each address with a list of addressesunique to each of a plurality of known document sources. Each textelement may further include a size. The potential format of each datadetector may further include a potential size.

According to another aspect of the disclosure, a system forformat-agnostic document ingestion, includes a document ingestion serverhaving a processor and a memory. The document ingestion server iscommunicatively coupled to a database, the processor configured toreceive an image of a document, the document including text arranged inan unknown format. The processor is also configured to convert, usingoptical character recognition, the image of the document into aplurality of text elements, each text element having a content, a size,and an absolute position within the document. The processor is alsoconfigured to retrieve a plurality of data detectors from the database,each data detector associated with a data type that is anticipated to bein the document. Each data detector includes at least one identifierthat is a potential label, at least one direction describing a potentialrelative direction of a text element having a label associated with thedata detector, and at least one validation criteria. Each validationcriteria describes a valid format. The processor is further configuredto identify a potential descriptor by comparing the content of each textelement with the at least one identifier of at least one data detector,and determine if the text element pointed to by one of the at least onedirection of the data detector used to identify the potential descriptormeets the validation criteria of the data detector, and associate thevalidated text element with the data detector. Finally, the processor isconfigured to store, for each text element associated with one datadetector of the plurality of data detector, the content of the textelement, in the database.

Particular embodiments may comprise one or more of the followingfeatures. The processor may be further configured to identify a documenttype by searching the content of each text element for a plurality ofdistinguishing strings. Each distinguishing string may be unique to onedocument type. The plurality of data detectors retrieved from thedatabase may be selected based on the document type. Each identifier maybe at least one of a potential label and a potential format. Eachvalidation criteria may describe at least one of a valid format and avalid range. The processor may be further configured to determine asource of the document by comparing the at least one identifier of adata detector associated with a data type that may be unique amongpotential document sources with the content of each text element of theplurality of text elements, and, for each data detector, may order atleast one of the identifiers and the directions according to a historystored in the database and associated with the source. The processor maybe further configured to update, for each data detector, the historyassociated with the source, according to which identifier of the atleast one identifier and which direction of the at least one directionmatched the most text elements of the data type described by the datadetector in the document. The processor may be further configured toidentify a table within the document by calculating for each textelement of the plurality of text elements a relative position of atleast one neighboring text element relative to the text element usingthe absolute position of the text element, and/or comparing the relativepositions of the plurality of text elements. The processor may beconfigured to locate a header for the table by comparing the content ofthe text elements within the table with the identifiers of the pluralityof data detectors and then identifying the data type of the matchingtext elements. The header may be one of a row and a column. Theprocessor may be configured to validate, for each identified textelement in the header, at least one text element within the other of arow and a column described by the identified text element in the headerwith the validation criteria of the data detector that identified theidentified text element in the header. Lastly, the processor may beconfigured to associate, for each identified text element in the header,at least one validated text element within the other of the row and thecolumn described by the identified text element in the header with thedata detector that identified the identified text element in the header.

According to yet another aspect of the disclosure, a method forformat-agnostic document ingestion includes receiving, by a processor,an image of a document, the document having text arranged in an unknownformat. The method also includes converting, using optical characterrecognition performed by the processor, the image of the document into aplurality of text elements, each text element having a content, a size,and an absolute position within the document. The method furtherincludes retrieving a plurality of data detectors, each data detectorassociated with a data type that is anticipated to be in the document,each data detector having at least one identifier that is a potentiallabel, at least one direction describing a potential relative directionof a text element having a label associated with the data detector, andat least one validation criteria. Each validation criteria describes avalid format. The method includes identifying a potential descriptor bycomparing the content of each text element with the at least oneidentifier of at least one data detector, determining if the textelement pointed to by one of the at least one direction of the datadetector used to identify the potential descriptor meets the validationcriteria of the data detector, and associating the validated textelement with the data detector. Finally, the method includes storing,for each text element associated with one data detector of the pluralityof data detector, the content of the text element.

Particular embodiments may comprise one or more of the followingfeatures. The method may also include identifying a document type bysearching the content of each text element for a plurality ofdistinguishing strings, each distinguishing string being unique to onedocument type. The plurality of data detectors retrieved may be selectedbased on the document type. Each identifier may be one of a potentiallabel and a potential format. Each validation criteria may describe oneof a valid format and a valid range. The method may further includedetermining a source of the document by comparing the at least oneidentifier of a data detector associated with a data type that may beunique among potential document sources with the content of each textelement of the plurality of text elements. The method may also includeordering, for each data detector, at least one of the identifiers andthe directions according to a history associated with the source, andupdating, for each data detector, the history associated with thesource, according to which identifier of the at least one identifier andwhich direction of the at least one direction matched the most textelements of the data type described by the data detector in thedocument. The method may also include identifying a table within thedocument by calculating for each text element of the plurality of textelements a relative position of at least one neighboring text elementrelative to the text element using the absolute position of the textelement, and comparing the relative positions of the plurality of textelements. The method may further include locating a header for the tableby comparing the content of the text elements within the table with theidentifiers of the plurality of data detectors and then identifying thedata type of the matching text elements. The header may be one of a rowand a column. The method may include validating, for each identifiedtext element in the header, at least one text element within the otherof a row and a column described by the identified text element in theheader with the validation criteria of the data detector that identifiedthe identified text element in the header, and associating, for eachidentified text element in the header, at least one validated textelement within the other of the row and the column described by theidentified text element in the header with the data detector thatidentified the identified text element in the header. The method mayalso include training a machine learning model correlating text elementswith the data detectors they have been associated with, and/ordetermining whether the machine learning model performs better than oneor more data detectors, and/or automatically employ the machine learningmodel in place of the one or more data detectors once the machinelearning model outperforms the one or more data detectors. Determiningthe source of the document may include identifying all postal addressesin the document based upon an observed format, validating each postaladdress, placing each postal address in a standard format, and/orcomparing each address with a list of addresses unique to each of aplurality of known document sources.

Aspects and applications of the disclosure presented here are describedbelow in the drawings and detailed description. Unless specificallynoted, it is intended that the words and phrases in the specificationand the claims be given their plain, ordinary, and accustomed meaning tothose of ordinary skill in the applicable arts. The inventors are fullyaware that they can be their own lexicographers if desired. Theinventors expressly elect, as their own lexicographers, to use only theplain and ordinary meaning of terms in the specification and claimsunless they clearly state otherwise and then further, expressly setforth the “special” definition of that term and explain how it differsfrom the plain and ordinary meaning. Absent such clear statements ofintent to apply a “special” definition, it is the inventors’ intent anddesire that the simple, plain and ordinary meaning to the terms beapplied to the interpretation of the specification and claims.

The inventors are also aware of the normal precepts of English grammar.Thus, if a noun, term, or phrase is intended to be furthercharacterized, specified, or narrowed in some way, then such noun, term,or phrase will expressly include additional adjectives, descriptiveterms, or other modifiers in accordance with the normal precepts ofEnglish grammar. Absent the use of such adjectives, descriptive terms,or modifiers, it is the intent that such nouns, terms, or phrases begiven their plain, and ordinary English meaning to those skilled in theapplicable arts as set forth above.

Further, the inventors are fully informed of the standards andapplication of the special provisions of 35 U.S.C. § 112(f). Thus, theuse of the words “function,” “means” or “step” in the DetailedDescription or Description of the Drawings or claims is not intended tosomehow indicate a desire to invoke the special provisions of 35 U.S.C.§ 112(f), to define the invention. To the contrary, if the provisions of35 U.S.C. § 112(f) are sought to be invoked to define the inventions,the claims will specifically and expressly state the exact phrases“means for” or “step for”, and will also recite the word “function”(i.e., will state “means for performing the function of [insertfunction]”), without also reciting in such phrases any structure,material or act in support of the function. Thus, even when the claimsrecite a “means for performing the function of ... ” or “step forperforming the function of...,” if the claims also recite any structure,material or acts in support of that means or step, or that perform therecited function, then it is the clear intention of the inventors not toinvoke the provisions of 35 U.S.C. § 112(f). Moreover, even if theprovisions of 35 U.S.C. § 112(f) are invoked to define the claimedaspects, it is intended that these aspects not be limited only to thespecific structure, material or acts that are described in the preferredembodiments, but in addition, include any and all structures, materialsor acts that perform the claimed function as described in alternativeembodiments or forms of the disclosure, or that are well known presentor later-developed, equivalent structures, material or acts forperforming the claimed function.

The foregoing and other aspects, features, and advantages will beapparent to those artisans of ordinary skill in the art from theDESCRIPTION and DRAWINGS, and from the CLAIMS.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosure will hereinafter be described in conjunction with theappended drawings, where like designations denote like elements, and:

FIG. 1 is a network view of a format-agnostic document ingestion system;

FIG. 2 is a schematic view of the contents of a database belonging to aformat-agnostic document ingestion system;

FIG. 3 is a schematic view of an exemplary document;

FIG. 4 is a schematic flow for document ingestion using aformat-agnostic document ingestion system;

FIG. 5 is a process flow for a method for format-agnostic documentingestion; and

FIG. 6 is a schematic view of a specific computing device that can beused to implement the methods and systems disclosed herein.

DETAILED DESCRIPTION

This disclosure, its aspects and implementations, are not limited to thespecific material types, components, methods, or other examplesdisclosed herein. Many additional material types, components, methods,and procedures known in the art are contemplated for use with particularimplementations from this disclosure. Accordingly, for example, althoughparticular implementations are disclosed, such implementations andimplementing components may comprise any components, models, types,materials, versions, quantities, and/or the like as is known in the artfor such systems and implementing components, consistent with theintended operation.

The word “exemplary,” “example,” or various forms thereof are usedherein to mean serving as an example, instance, or illustration. Anyaspect or design described herein as “exemplary” or as an “example” isnot necessarily to be construed as preferred or advantageous over otheraspects or designs. Furthermore, examples are provided solely forpurposes of clarity and understanding and are not meant to limit orrestrict the disclosed subject matter or relevant portions of thisdisclosure in any manner. It is to be appreciated that a myriad ofadditional or alternate examples of varying scope could have beenpresented, but have been omitted for purposes of brevity.

While this disclosure includes a number of embodiments in many differentforms, there is shown in the drawings and will herein be described indetail particular embodiments with the understanding that the presentdisclosure is to be considered as an exemplification of the principlesof the disclosed methods and systems, and is not intended to limit thebroad aspect of the disclosed concepts to the embodiments illustrated.

Although there is increasing interest in moving various workflows andprocesses to be entirely electronic, a great deal of interactionsbetween companies and individuals involves human-readable documents,either “analog” (e.g. paper documents) or digital (e.g. PDF documents).Even among documents of the same type, such as shipping manifests forfreight delivery services, each party will execute that task in aslightly different manner, resulting in everyone having a differentdocument format.

This has been a long standing problem, and is one that complicatesefforts to ingest a large number of documents, whether to digitize oldphysical records or to gather documents from various sources. Whilethese tasks may be accomplished manually on a small scale, those methodsdo not scale well, and quickly get bogged down when trying to coordinatemultiple document formats.

Conventional solutions to this problem have relied on systems thatunderstand the various document formats. However, these conventionalsystems require training, and are fragile. A small change in documentformat throw such systems into chaos. While slightly more scalable thanmanual effort, conventional solutions require a level of preparation andupkeep that often negates most of the efficiency of the automation.

Contemplated herein is a system and method for format-agnostic documentingestion. The ability to ingest documents independent of how they areformatted allows the automation and/or scaling of processes thatpreviously had to be done manually, or automated only after greatexpense, time, and effort from humans.

Advantageous over conventional systems and methods, the format-agnosticdocument ingestion systems and methods contemplated herein are able tooperate without needing to know how the documents are formatted.Conventional systems must rely on either manual entry of thesedocuments, or fragile automated systems that must be manually taughtevery new document format. Format updates can require conventionalsystems to stop operating while the new format is trained. The systemscontemplated herein are able to continue operating, even through manyrapid format changes.

The format-agnostic document ingestion systems and methods contemplatedherein facilitate the interaction of two organizations, each of whom mayhave a particular way of communicating and storing data. Removing theformatting to documents from the mix allows these organizations to workcooperatively without having to alter their own systems or workflows,which are often built up over a long time span, at great cost.Conventional systems created to facilitate these interactions would needto be configured for the particular needs and methods of theorganizations; any change on either end would require moreconfiguration. The systems and methods contemplated herein canaccomplish the task with greater efficiency, little to no initialconfiguration, and even less configuration to deal with format changes.Additional use cases will be discussed, below.

In the context of the present description and the claims that follow,format-agnostic means the system is able to extract desired textual datafrom a document independent of how that data is organized within thedocument, how it is labeled (e.g. how it is described, where the labelsare with respect to the information, etc.), or how it is displayed.Furthermore, in the context of the present description and the claimsthat follow, ingestion refers to the capture of information from adocument, placing it in an electronic form that increases its utility(e.g. placement into a universally recognized data format, making paperrecords digital, etc.).

It should be noted that, while the discussion of systems and methodsbelow will mainly be done in the context of ingesting a paper documentpresented to the system as an image, these systems and methods may beapplied to documents existing in a wide range of states. For example, insome embodiments, the system may receive a document that is electronicin form, but has been formatted for human consumption (e.g. a formattedPDF of an invoice, etc.). As another example, a document may be providedto the system in an electronic machine-readable form (e.g. a databaserecord, CSV file, etc.), but makes use of an unknown labeling ororganizational method (e.g. some people organize a set of data in rows,while others do it in columns, etc.). Additional examples will beprovided below.

FIG. 1 is a network view of a non-limiting example of a format-agnosticdocument ingestion system 100. As shown, the system 100 comprises adocument ingestion server 102. Going forward, reference is made to auser 124 of the system 100. In the context of the present descriptionand the claims that follow, a user 124 is an individual interfacing orinteracting with the system 100. They have access to the system and areable to perform the same operations as a corporate third party who mayinteract with the server 102 over a network 114.

As shown, the system 100 includes a document ingestion server 102 havinga processor 118 and a memory 120. The server 102 is responsible forcollecting the documents 108 (e.g. bills, invoices, review, etc.). Insome embodiments, the server 102 may be a discrete piece of hardware,while in others it may be a distributed computing environment spreadacross multiple machines. In some embodiments, the server 102 may beimplemented in a cloud environment (e.g. the functionality described maybe provided in an instantiated environment implemented on remotehardware, etc.). In other embodiments the system 102 may operateremotely and be offered over the network as a software-as-a-service(SAAS).

In some embodiments, the server 102 may be communicatively coupled to adatabase 104 through a network 114, the database 104 also a part of thesystem 100. In some embodiments, the database 104 may be localized with(e.g. internal to, etc.) the server 102. In other embodiments, thedatabase 104 may be distinct from the server 102. In still otherembodiments, the database 104 may be remote to the server 102 (e.g.executed in a cloud environment, etc.). The database 104 may be use tostore various information used by the system 100, including but notlimited to user profiles and preferences, bills, explanation ofbenefits, financial records, payment methods, payment histories,biographical information, and the like. The database 104 may beimplemented in any architecture known in the art, such as SQL, NoSQL,and the like. It should be noted that there are also embodiments wherethe system 100 does not have a database 104, and all of the storage andretrieval operations discussed below are performed in the memory 120 ofthe server 102.

Users 124 may interact with the server 102 through a client device 122(e.g. phone, tablet, laptop computer, desktop computer, etc.). Accordingto various embodiments, the user-server interaction may be accomplishedthrough various interfaces, including but not limited to, a web portal,a specialized app or application, and the like.

According to various embodiments, a document 108 may be submitted to theserver 102 in the form of an image (see image 400 of FIG. 4 ). In someembodiments, a user 124 may submit a document 108 by capturing an image400 using a camera on their client device 122 (e.g. smartphone, etc.).In other embodiments, the document 108 may be captured using a capturedevice 116. In the context of the present description and the claimsthat follow, a capture device 116 is an imaging device that is designedfor imaging documents. Examples include, but are not limited to, flatbedscanners, sheet-fed scanners, book scanners, and the like.

According to various embodiments, the server 102 may receive someelectronic representation of the document 108 through one or morechannels. In some embodiments, an image 400 may be submitted to theserver 102 through an application interface. As a specific example, inone embodiment, a user 124 may capture an image 400 of the document 108using the camera on their smartphone client device 122, and thereaftersubmit it to the server 102 using an app loaded on the phone.

In other embodiments, electronic documents may be sent to the server 102using the same channels they were received through. For example, in oneembodiment, an electronic document 108 received by a user 124 in anemail message may be submitted to the server 102 by simply forwardingthe email. As an option, the system 100 may provide the user 124 anemail address that is unique to them, to which they can send items foringestion and association with them or an entity they represent. Instill other embodiments, the server 102 may allow for the uploading ofdocuments using authenticated and secure protocols and methods (e.g.FTP, SFTP, etc.) as is known in the art.

In some embodiments, the server 102 may also be configured to interactwith one or more production servers 112. In the context of the presentdescription and the claims that follow, a production server 112 is aserver that is associated with the production of a document 108. In someembodiments, it a server affiliated with the party who produced adocument 108, while in others, the production server 112 itselfgenerated the document 108. As will be discussed in greater detailbelow, it may be advantageous to be able to determine the source of adocument 108.

In other embodiments, the server 102 may also communicate with thirdparty servers 110 for interactions that do not involve obtaining adocument 108 or returning a results. For example, in some embodiments,the third party server 110 may be contacted to validate some informationobtained by the system 102 during the ingestion process (e.g. determinewhether an address exists, whether a credit card number is valid, etc.).In other embodiments, the third party server 110 may be used to make adetermination based, at least in part, on information obtained from aningested document.

FIG. 2 is a schematic view of a non-limiting example of a database 104belonging to a format-agnostic document ingestion system 100. As shown,the database 104 may be used to store a variety of informationincluding, but not limited to, document records 236 as well asinformation used to ingest documents provided by a user 124. It shouldbe noted that the data objects depicted in FIG. 2 are meant to providecontext for a discussion of the various pieces of data that the system100 works with, and is not meant to be limiting or imply the requirementof a particular structure or storage format. Those skilled in the artwill recognize that there are numerous ways the following informationmay be organized, stored, and accessed within a database.

The database 104 may store a plurality of data detectors 200. In thecontext of the present description and the claims that follow, a datadetector 200 is a collection of data that is used to identify andvalidate particular pieces of information among the text extracted froma document 108.

As shown, each data detector 200 is associated with a data type 202. Inthe context of the present description and the claims that follow, adata type 202 refers to a specific piece of data (e.g. date of invoice,type of healthcare service rendered, etc.) rather than the species ofdata (e.g. date, string, etc.).

Each data detector 200 comprises at least one identifier 204, which is apiece of information that can be used to identify an instance of thedata type 202 within a document 108. One example of an identifier is apotential label 206, which is a textual indicator identifying a nearbypiece of information (e.g. “Payment due”, “ID number”, etc.).

Another example of an identifier is a potential format 208, meaning apatterning of kinds of data (e.g. a dollar sign followed by numbers anda period and two more numbers to indicate a monetary value, etc.).Examples of potential format 208 identifiers include, but are notlimited to, a potential size 210 (e.g. a size of the element relative tothe document in which it is found, the absolute size of an element,etc.), a regular expression 212 (e.g. looks for character sequences thatmatch a regular expression defined within the data detector 200 thatseeks for particular characters or classes of characters that arearranged in a particular manner, etc.), and a potential font 214.

In some embodiments, a data detector 200 may require the observation ofboth a label 206 and a format 208, while in others it may be limited toone or the other. In some embodiments, a data detector 200 may have morethan one identifier 204, for information that may be labeled in numerousways. For example, an identification number could have the labels “IDnumber”, “ID No.”, “ID”, and the like.

As shown, the data detector 200 also comprises at least one direction216, meaning one or more potential relative directions 218 where a valuemay be found, with respect to its identifier. For example, a datadetector 200 having an potential label 206 of “Statement Date” mayinclude “right” as a direction 216. Those skilled in the art willrecognize that relative direction information may be represented innumerous ways, including verbal (e.g. “right”, “up”, etc.) and numerical(e.g. Cartesian offset, polar coordinates, etc.).

According to various embodiments, the data detector 200 may also includeat least one validation criteria 220, which must be satisfied in orderto accept a piece of text content as a value being described by theidentifier 204. Examples include, but are not limited to, valid formats222 (e.g. two decimal places for dollar amounts, etc.) and valid ranges224 (e.g. date of birth within the last 120 years, etc.).

The database 104 also may comprise one or more history records 226. Inthe context of the present description and the claims that follow, ahistory 226 is a collection of observations made during previousdocument ingestion that are specific to a particular document source(e.g. healthcare provider, bank, etc.) and/or particular document type(e.g. bill, receipt, check, credit card, insurance card, manifest,etc.). It should be noted that a history 226 is different than arecording of a particular document format. Instead, it provides apreferred ordering for the various potential relative directions 218and/or identifiers 204 that may be observed by that data detector 200.For example, in one embodiment, if the first medical bill from NorthHospital ingested by the server 102 positioned the dollar value for thetotal amount due to the right of the “total amount due” label, thehistory 226 for North Hospital bills would place “right” at the front ofa list of possible directions for the data detector 200 for that label.The use of histories 226 is advantageous as it permits optimizedoperation based on previous documents without locking the system 100into a particular document format, as is done with conventional systems.

As shown, the database 104 may also comprises a plurality of documentrecords 236, according to some embodiments, each comprising dataregarding a particular document 108. It should be noted that saiddocument 108 is not limited to tangible documents ingested into thesystem, but also electronic, human readable documents (e.g. a PDF of anexplanation of benefits, etc.), digital files read for storage in thedatabase 104, and the like.

As shown, according to various embodiments, the document record 236 maycomprise a document type 228 (e.g. bill, receipt, manifest, etc.), andother ingested content 240 (e.g. data ingested from the document 108that has been verified by the user 124 or some other entity or process,etc.). In some embodiments, the information obtained from a processedimage 400 may be stored, but the image deleted or downgraded inresolution to preserve storage space. In other embodiments, the image400 may be stored in the database 104, either as part of the documentrecord 236 or as a separate record linked to the document record 236.

The database 104 may comprise information about various documentsources, to facilitate the determination of the source of variousdocuments 108 ingested into the server 102. As will be discussed ingreater detail below with respect to FIG. 4 , in some embodiments, thedatabase 104 may comprise a list 234 of postal addresses that are uniqueto known document sources. In other embodiments, additional oralternative identifying information may be stored and used in similarmanner.

In some embodiments, the methods discussed thus far may be used as astepping stone toward automation with artificial intelligence or machinelearning. Conventional document management and classification systemthat use machine learning or artificial intelligence can be effective,but at the cost of painstaking, human-driven training and modelrefinement. Contemplated herein is a system 100 that not only performsbetter than conventional systems, it is able to train its own potentialreplacement.

According to various embodiments, a machine learning model 232 may betrained to correlate text elements (see text elements 402 of FIG. 4 )with data detectors 200 they have been associated with by the server102. Periodically, as this model 232 continues to train while the server102 ingests more documents, the server 102 determines if the machinelearning model 232 performs better than one or more of the datadetectors 200 being modeled. If the model performs better, the server102 may automatically employ the machine learning model 232 in place ofthe one or more data detectors, once they fall behind in performance.

FIG. 3 is a schematic view of a non-limiting example of a document 108.Specifically, FIG. 3 shows a non-limiting example of a explanation ofbenefits 300 with an unknown format 310. It should be noted that the useof an explanation of benefits in this non-limiting example is not meantto imply that the systems 100 and methods contemplated herein are boundto any particular industry, or any particular kind of document. Thesesystems and methods may be adapted for use with any document containinginformation. As will be discussed in greater detail with respect to FIG.4 , each document has information that is identifiable by position andor a label 306. Until a data detector 200 is employed, each piece oftext is a potential descriptor 308.

The system 100 is able to distinguish between two or more documenttypes, according to various embodiments. This is particularly beneficialfor document types that may otherwise be confused by a user 124. Forexample, in some embodiments, the system 100 is able to differentiatebetween an explanation of benefits and a medical bill, which sometimescan look very similar. In some embodiments, the system 100 is able todistinguish between document types by searching for a distinguishingstring 312, which is a piece of text that is reliably unique to aparticular document type. For example, explanation of benefits oftenhave “This is not a bill” printed on them somewhere.

Many documents have a source 230 identified, indicating where thedocument 108 came from or who produced it. Many documents also have apostal address 314 for the source 230, provided in an observed postaladdress format 316.

As shown, an explanation of benefits 300 may include, but is not limitedto, patient name, a source 230, a total amount due, a balance due date,a date of service, service details, a statement date, and/or an accountnumber. Explanations of benefits 300 may also include a payment amountand a payment due date, for bills with an established payment plan.

FIG. 4 is a schematic flow for a non-limiting example of documentingestion using a format-agnostic document ingestion system 100.Format-agnostic document ingestion provides an advantage overconventional, rigid format based systems or non-scalable, expensivemanual systems. Being format-agnostic allows the system 100 to quicklyincorporate documents from previously unknown sources or in previouslyunknown formats, which is time consuming and costly for conventionalsystems to deal with. This also allows the system 100 to work with greatefficiency, without requiring document sources to make any changes tothe way they operate.

It should be noted that this process is intended for ingesting documents108 that were intended for human consumption. Documents or files thatare computer formatted (e.g. data provided as arrays and matrices, somedata structure having indisputable relationships between entries ratherthan contextual relationships on a two dimensional surface, etc.) may beingested using much more streamlined methods that may include some formof validation and label/format normalization.

First, the system 100 receives an image 400 of a document 108. Seecircle ‘1’. The image 400 may be a digital photograph of the document108 captured with the client device 122, a high resolution scan of thedocument 108 obtained with the capture device 116, or it may be anelectronic version of a human-formatted document (e.g. a PDF of anexplanation of benefits, etc.), or some other visual format.

Next, the image 400 is converted into a plurality of text elements 402using optical character recognition or machine vision to identify whatis text. See circle ‘2’. Characters are grouped together as words andsentences using various attributes that may include, but are not limitedto, kerning, spacing, alignment, font, and the like. These groupings areturned into individual text elements 402.

In the context of the present description and the claims that follow, atext element 402 is a data object that comprises a content 404 (e.g. thecharacters, etc.) and an absolute position 408 of that content withinthe document (e.g. Cartesian coordinates + page number, etc.). In someembodiments, each text element 402 may also include a size 406 of thetext, or other visual attributes.

According to some embodiments, the system 100 determines what type ofdocument 108 is being ingested. See circle ‘3’. In some embodiments, thedocument type may be provided by a user 124 when they capture an imageof the document 108. In other embodiments, the system 100 mayautomatically differentiate between two or more possible document typesusing various methods.

In some embodiments, the system 100 may differentiate between a numberof potential document types by searching the content 404 of each textelement 402 in the image 400 for at least one distinguishing string 312that is reliably unique to one particular document type. As a specificexample, medical bills from a hospital can sometimes be difficult todistinguish from explanations of benefit from an insurance provider, asthey both contain much of the same information. However, the explanationof benefits often includes the words “This is not a bill”; searching forthat content 404 may allow the system 100 to determine if the image 400is a bill or an explanation of benefits.

In some embodiments, the system 100 may determine what kind of document108 is being ingested using structural features. For example, in oneembodiment, the system 100 may use the size 320 of the document 108 todifferentiate between a card sized document and a letter sized document.This determination may be made by comparing the size 406 of the textelements 402 with the relative size 320 of the document 108 in the image400, allowing the system 100 to conclude if the image 400 is of a cardor a letter sized document. Other document types may be determined usingvisual features that are specific and common to that type (e.g. theparticular OCR font used to print the serial number along the bottomleft corner of a check, etc.).

Next, in some embodiments, the system 100 determines the identity of thedocument source 230 (e.g. store, insurance company, government, etc.)that created the document 108 being considered. Some embodimentsapproach this task using postal addresses 314, which have a predictableobserved format 316 and are relatively easy to identify.

First, each element 402 is examined to determine if it contains a postaladdress 314. See circle ‘4’. After locating an address, it is placed ina standard postal address format 318 to facilitate comparison with alist 234 of addresses unique to known document sources. If the addressmatches an entry on the list 234, the source 230 has been identified. Ifnone of the addresses found in the image 400 are on the list 234, thenthe system 100 examines the elements 402 that are closest to the foundaddresses, seeking the source’s name. The name may be identified usingvarious methods, including but not limited to comparison with a list ofpotential sources, a comparison of size and/or formatting of an elementclosest to the address with the majority of the other elements (e.g.source name is likely to be visually distinct, etc.), and the like.

Once the document source 230 has been identified, a plurality of datadetectors 200 may be retrieved from the database 104. See circle ‘5’.The data detectors 200 are selected based on the data types 202 that areanticipated to be in the document, and may also be chosen based, atleast in part, on the document type (e.g. a first plurality of datadetectors 200 a may be chosen for an order form and a second pluralityof data detectors 200 b may be chosen for an invoice, etc.).

In some embodiments, a history 226 specific to that source 230 may alsobe retrieved from the database 104. Using the history 226, one or moredata detectors 200 may be configured, which includes but is not limitedto changing the order of the various identifiers 204 and/or directions216 to reflect the previous observations, as was discussed above.

Once the data detectors 200 have been retrieved and configured, thecontent of the document 108 may be ingested. In some embodiments, thesystem 100 may begin with the ingestion of any tables 410 in thedocument. See circle ‘6’.

According to various embodiments, a table 410 is identified bycalculating, for each element 402, a relative position of at least oneneighboring text element 414 using the absolute positions 408 recordedfor each element 402. Using these relative positions, elements that arearranged in rows and columns will be apparent.

Next, a header 416 is located, the header 416 being a row 418 or column420 that contains labeling information for a portion of the table 410.The header 416 may be located by comparing the content of the potentialtable entries along the borders with the content in the center,according to some embodiments.

Once a potential header 416 is located, it may be confirmed byvalidating the content of at least one element in the row or columnrepresented by the header element. For example, if the potential headerelement has the content “Cost”, validation may determine if the elementsrepresented by that header element conform to the validation criteria220 of the data detector 200 for that particular “Cost”.

Upon validation of the header and the data it represents, the remainingvalidated elements 402 of the table 410 are associated with the variousdata detectors 200 that are appropriate for that particular data type(e.g. the detectors 200 used to identify and validate the headerelements, etc.).

Next, the remaining elements 402 (i.e. the elements that are not in anytables 410) are examined by first identifying a potential descriptor 308(i.e. a potential label 306) by comparing the element content withvarious data detectors. Upon finding a potential descriptor 308, it isdetermined if the element pointed to by one of the at least onedirection 216 of the data detector 200 used to identify the potentialdescriptor 308 meets the validation criteria 220 of the data detector200. See circle ‘7’. The validated text elements 402 are then associatedwith the various appropriate data detectors 200, meaning their contentis noted to be of the data type represented by that data detector usedto validate.

Next, in some embodiments, the content 404 of the validated textelements 402 are stored in the database 104, organized into a documentrecord 236 according to the associations made with the data detectors200. See circle ‘8’. Finally, for each data detector 200 that matched,the history 226 associated with the source (i.e. source 230) is updatedaccording to which identifier 204 and/or direction 216 matched the mosttext elements of that data type 202 in the document 108. See circle ‘9’.This advantageously allows the system 100 to continue to operate withefficiency and agility, able to adapt to a change in document formatfrom that source 230 after a single ingestion, without being stymied bythe new formatting like most conventional systems.

FIG. 5 is a process flow for a non-limiting example of a method forformat-agnostic document ingestion. As shown, the method 500 includesreceiving, by a processor, an image of a document, the documentcomprising text arranged in an unknown format (step 502), and thenconverting (step 504), using optical character recognition performed bythe processor, the image of the document into a plurality of textelements, each text element comprising a content, a size, and anabsolute position within the document.

In some embodiments, the server retrieves a plurality of data detectors,each data detector associated with a data type that is anticipated to bein the document, each data detector comprising at least one identifierthat is a potential label, at least one direction describing a potentialrelative direction of a text element having a label associated with thedata detector, and at least one validation criteria, wherein eachvalidation criteria describes a valid format (step 506).

The method 500 further includes identifying a potential descriptor (e.g.data label) by comparing the content of each text element with the atleast one identifier of at least one data detector (step 508). It isthen determined if the text element pointed to by one of the at leastone direction of the data detector used to identify the potentialdescriptor meets the validation criteria of the data detector (step510). Finally, the method includes associating the validated textelement with the data detector (step 512) and storing, for each textelement associated with one data detector of the plurality of datadetector, the content of the text element (step 514). In someembodiments, the content may be stored in a database, while in others itmay be stored in a different form of electronic storage, or eventransmitted over a network.

Applications for the systems and methods contemplated herein span acrossa number of different industries including, but not limited to, datascience (e.g. automated creation of training data sets for AI/ML models,etc.), accounting and finance (e.g. the ingestion and sorting of anumber of tax related documents such as receipts, invoices, manifests,and the like, etc.), retail (e.g. consolidation of wine tech sheetshaving a highly variable format with non-standard content, etc.), andhealthcare (e.g. management of bills and insurance claims, etc.).

In some embodiments, the systems and methods contemplated herein may beused to allow one party interface with the legacy system of another, orinterface a legacy system with a bleeding edge system, using thesemethods to bridge the gap in data and document formats between thedifferent parties involved.

FIG. 6 is a schematic diagram of specific computing device 600 and aspecific mobile computing device 650 that can be used to perform and/orimplement any of the embodiments disclosed herein. In one or moreembodiments, document ingestion server 102, production server 112,database 104, third party server 110, client device 122 and/or capturedevice 116 of FIG. 1 may be the specific computing device 600.Furthermore, in one or more embodiments, client device 122, and/orcapture device 116 of FIG. 1 may be the specific mobile computing device650.

The specific computing device 600 may represent various forms of digitalcomputers, such as laptops, desktops, workstations, personal digitalassistants, servers, blade servers, mainframes, and/or other appropriatecomputers. The specific mobile computing device 650 may representvarious forms of mobile devices, such as smartphones, camera phones,personal digital assistants, cellular telephones, and other similarmobile devices. The components shown here, their connections, couples,and relationships, and their functions, are meant to be exemplary only,and are not meant to limit the embodiments described and/or claimed,according to one embodiment.

The specific computing device 600 may include a processor 602, a memory604, a storage device 606, a high speed interface 608 coupled to thememory 604 and a plurality of high speed expansion ports 610, and a lowspeed interface 612 coupled to a low speed bus 614 and a storage device606. In one embodiment, each of the components heretofore may beinter-coupled using various buses, and may be mounted on a commonmotherboard and/or in other manners as appropriate. The processor 602may process instructions for execution in the specific computing device600, including instructions stored in the memory 604 and/or on thestorage device 606 to display a graphical information for a GUI on anexternal input/output device, such as a display unit 616 coupled to thehigh speed interface 608, according to one embodiment.

In other embodiments, multiple processors and/or multiple buses may beused, as appropriate, along with multiple memories and/or types ofmemory. Also, a plurality of specific computing device 600 may becoupled with, with each device providing portions of the necessaryoperations (e.g., as a server bank, a group of blade servers, and/or amulti-processor system).

The memory 604 may be coupled to the specific computing device 600. Inone embodiment, the memory 604 may be a volatile memory. In anotherembodiment, the memory 604 may be a non-volatile memory. The memory 604may also be another form of computer-readable medium, such as a magneticand/or an optical disk. The storage device 606 may be capable ofproviding mass storage for the specific computing device 600. In oneembodiment, the storage device 606 may be includes a floppy disk device,a hard disk device, an optical disk device, a tape device, a flashmemory and/or other similar solid state memory device. In anotherembodiment, the storage device 606 may be an array of the devices in acomputer-readable medium previously mentioned heretofore,computer-readable medium, such as, and/or an array of devices, includingdevices in a storage area network and/or other configurations.

A computer program may be comprised of instructions that, when executed,perform one or more methods, such as those described above. Theinstructions may be stored in the memory 604, the storage device 606, amemory coupled to the processor 602, and/or a propagated signal.

The high speed interface 608 may manage bandwidth-intensive operationsfor the specific computing device 600, while the low speed interface 612may manage lower bandwidth-intensive operations. Such allocation offunctions is exemplary only. In one embodiment, the high speed interface608 may be coupled to the memory 604, the display unit 616 (e.g.,through a graphics processor and/or an accelerator), and to theplurality of high speed expansion ports 610, which may accept variousexpansion cards.

In the embodiment, the low speed interface 612 may be coupled to thestorage device 606 and the low speed bus 614. The low speed bus 614 maybe comprised of a wired and/or wireless communication port (e.g., aUniversal Serial Bus (“USB”), a Bluetooth® port, an Ethernet port,and/or a wireless Ethernet port). The low speed bus 614 may also becoupled to the scan unit 628, a printer 626, a keyboard, a mouse 624,and a networking device (e.g., a switch and/or a router) through anetwork adapter.

The specific computing device 600 may be implemented in a number ofdifferent forms, as shown in the figure. In one embodiment, the specificcomputing device 600 may be implemented as a standard server 618 and/ora group of such servers. In another embodiment, the specific computingdevice 600 may be implemented as part of a rack server system 622. Inyet another embodiment, the specific computing device 600 may beimplemented as a general computer 620 such as a laptop or desktopcomputer. Alternatively, a component from the specific computing device600 may be combined with another component in a specific mobilecomputing device 650. In one or more embodiments, an entire system maybe made up of a plurality of specific computing device 600 and/or aplurality of specific computing device 600 coupled to a plurality ofspecific mobile computing device 650.

In one embodiment, the specific mobile computing device 650 may includea mobile compatible processor 652, a mobile compatible memory 654, andan input/output device such as a mobile display 666, a communicationinterface 672, and a transceiver 658, among other components. Thespecific mobile computing device 650 may also be provided with a storagedevice, such as a microdrive or other device, to provide additionalstorage. In one embodiment, the components indicated heretofore areinter-coupled using various buses, and several of the components may bemounted on a common motherboard.

The mobile compatible processor 652 may execute instructions in thespecific mobile computing device 650, including instructions stored inthe mobile compatible memory 654. The mobile compatible processor 652may be implemented as a chipset of chips that include separate andmultiple analog and digital processors. The mobile compatible processor652 may provide, for example, for coordination of the other componentsof the specific mobile computing device 650, such as control of userinterfaces, applications run by the specific mobile computing device650, and wireless communication by the specific mobile computing device650.

The mobile compatible processor 652 may communicate with a user throughthe control interface 656 and the display interface 664 coupled to amobile display 666. In one embodiment, the mobile display 666 may be aThin-Film-Transistor Liquid Crystal Display (“TFT LCD”), an OrganicLight Emitting Diode (“OLED”) display, and another appropriate displaytechnology. The display interface 664 may comprise appropriate circuitryfor driving the mobile display 666 to present graphical and otherinformation to a user. The control interface 656 may receive commandsfrom a user and convert them for submission to the mobile compatibleprocessor 652.

In addition, an external interface 662 may be provide in communicationwith the mobile compatible processor 652, so as to enable near areacommunication of the specific mobile computing device 650 with otherdevices. External interface 662 may provide, for example, for wiredcommunication in some embodiments, or for wireless communication inother embodiments, and multiple interfaces may also be used.

The mobile compatible memory 654 may be coupled to the specific mobilecomputing device 650. The mobile compatible memory 654 may beimplemented as a volatile memory and a non-volatile memory. Theexpansion memory 678 may also be coupled to the specific mobilecomputing device 650 through the expansion interface 676, which maycomprise, for example, a Single In Line Memory Module (“SIMM”) cardinterface. The expansion memory 678 may provide extra storage space forthe specific mobile computing device 650, or may also store anapplication or other information for the specific mobile computingdevice 650.

Specifically, the expansion memory 678 may comprise instructions tocarry out the processes described above. The expansion memory 678 mayalso comprise secure information. For example, the expansion memory 678may be provided as a security module for the specific mobile computingdevice 650, and may be programmed with instructions that permit secureuse of the specific mobile computing device 650. In addition, a secureapplication may be provided on the SIMM card, along with additionalinformation, such as placing identifying information on the SIMM card ina non-hackable manner.

The mobile compatible memory may include a volatile memory (e.g., aflash memory) and a non-volatile memory (e.g., a non-volatilerandom-access memory (“NVRAM”)). In one embodiment, a computer programcomprises a set of instructions that, when executed, perform one or moremethods. The set of instructions may be stored on the mobile compatiblememory 654, the expansion memory 678, a memory coupled to the mobilecompatible processor 652, and a propagated signal that may be received,for example, over the transceiver 658 and/or the external interface 662.

The specific mobile computing device 650 may communicate wirelesslythrough the communication interface 672, which may be comprised of adigital signal processing circuitry. The communication interface 672 mayprovide for communications using various modes and/or protocols, suchas, a Global System for Mobile Communications (“GSM”) protocol, a ShortMessage Service (“SMS”) protocol, an Enhanced Messaging System (“EMS”)protocol, a Multimedia Messaging Service (“MMS”) protocol, a CodeDivision Multiple Access (“CDMA”) protocol, Time Division MultipleAccess (“TDMA”) protocol, a Personal Digital Cellular (“PDC”) protocol,a Wideband Code Division Multiple Access (“WCDMA”) protocol, a CDMA2000protocol, and a General Packet Radio Service (“GPRS”) protocol.

Such communication may occur, for example, through the transceiver 658(e.g., radio-frequency transceiver). In addition, short-rangecommunication may occur, such as using a Bluetooth®, Wi-Fi, and/or othersuch transceiver. In addition, a GPS (“Global Positioning System”)receiver module 674 may provide additional navigation-related andlocation-related wireless data to the specific mobile computing device650, which may be used as appropriate by a software application runningon the specific mobile computing device 650.

The specific mobile computing device 650 may also communicate audiblyusing an audio codec 660, which may receive spoken information from auser and convert it to usable digital information. The audio codec 660may likewise generate audible sound for a user, such as through aspeaker (e.g., in a handset smartphone of the specific mobile computingdevice 650). Such a sound may comprise a sound from a voice telephonecall, a recorded sound (e.g., a voice message, a music files, etc.) andmay also include a sound generated by an application operating on thespecific mobile computing device 650.

The specific mobile computing device 650 may be implemented in a numberof different forms, as shown in the figure. In one embodiment, thespecific mobile computing device 650 may be implemented as a smartphone668. In another embodiment, the specific mobile computing device 650 maybe implemented as a personal digital assistant (“PDA”). In yet anotherembodiment, the specific mobile computing device, 650 may be implementedas a tablet device 670.

Various embodiments of the systems and techniques described here can berealized in a digital electronic circuitry, an integrated circuitry, aspecially designed application specific integrated circuits (“ASICs”), apiece of computer hardware, a firmware, a software application, and acombination thereof. These various embodiments can include embodiment inone or more computer programs that are executable and/or interpretableon a programmable system including one programmable processor, which maybe special or general purpose, coupled to receive data and instructionsfrom, and to transmit data and instructions to, a storage system, oneinput device, and at least one output device.

These computer programs (also known as programs, software, softwareapplications, and/or code) comprise machine-readable instructions for aprogrammable processor, and can be implemented in a high-levelprocedural and/or object-oriented programming language, and/or inassembly/machine language. As used herein, the terms “machine-readablemedium” and/or “computer-readable medium” refers to any computer programproduct, apparatus and/or device (e.g., magnetic discs, optical disks,memory, and/or Programmable Logic Devices (“PLDs”)) used to providemachine instructions and/or data to a programmable processor, includinga machine-readable medium that receives machine instructions as amachine-readable signal. The term “machine-readable signal” refers toany signal used to provide machine instructions and/or data to aprogrammable processor.

To provide for interaction with a user, the systems and techniquesdescribed here may be implemented on a computing device having a displaydevice (e.g., a cathode ray tube (“CRT”) and/or liquid crystal (“LCD”)monitor) for displaying information to the user and a keyboard and amouse by which the user can provide input to the computer. Other kindsof devices can be used to provide for interaction with a user as well;for example, feedback provided to the user can be any form of sensoryfeedback (e.g., visual feedback, auditory feedback, and/or tactilefeedback) and input from the user can be received in any form, includingacoustic, speech, and/or tactile input.

The systems and techniques described here may be implemented in acomputing system that includes a back end component (e.g., as a dataserver), a middleware component (e.g., an application server), a frontend component (e.g., a client computer having a graphical userinterface, and/or a Web browser through which a user can interact withan embodiment of the systems and techniques described here), and acombination thereof. The components of the system may also be coupledthrough a communication network.

The communication network may include a local area network (“LAN”) and awide area network (“WAN”) (e.g., the Internet). The computing system caninclude a client and a server. In one embodiment, the client and theserver are remote from each other and interact through the communicationnetwork.

A number of embodiments have been described. Nevertheless, it will beunderstood that various modifications may be made without departing fromthe spirit and scope of the claimed invention. In addition, the logicflows depicted in the figures do not require the particular order shown,or sequential order, to achieve desirable results. In addition, othersteps may be provided, or steps may be eliminated, from the describedflows, and other components may be added to, or removed from, thedescribed systems. Accordingly, other embodiments are within the scopeof the following claims.

It may be appreciated that the various systems, methods, and apparatusdisclosed herein may be embodied in a machine-readable medium and/or amachine accessible medium compatible with a data processing system(e.g., a computer system), and/or may be performed in any order.

The structures and modules in the figures may be shown as distinct andcommunicating with only a few specific structures and not others. Thestructures may be merged with each other, may perform overlappingfunctions, and may communicate with other structures not shown to beconnected in the figures. Accordingly, the specification and/or drawingsmay be regarded in an illustrative rather than a restrictive sense.

Where the above examples, embodiments and implementations referenceexamples, it should be understood by those of ordinary skill in the artthat other document ingestion systems and methods could be intermixed orsubstituted with those provided. In places where the description aboverefers to particular embodiments of format-agnostic document ingestionsystems and methods, it should be readily apparent that a number ofmodifications may be made without departing from the spirit thereof andthat these embodiments and implementations may be applied to other todocument digitization systems and methods as well. Accordingly, thedisclosed subject matter is intended to embrace all such alterations,modifications and variations that fall within the spirit and scope ofthe disclosure and the knowledge of one of ordinary skill in the art.

1-20. (canceled)
 21. A system for format-agnostic document ingestion,comprising: a document ingestion server comprising a processor and amemory, the document ingestion server communicatively coupled to adatabase, the processor configured to: receive an image of a document,the document comprising text arranged in an unknown format; convert,using optical character recognition, the image of the document into aplurality of text elements, each text element comprising a content andan absolute position within the document; retrieve a plurality of datadetectors from the database, each data detector associated with a datatype that is anticipated to be in the document, each data detectorcomprising at least one identifier that is one of a potential label anda potential format, and at least one validation criteria; identify apotential descriptor by comparing the content of each text element withthe at least one identifier of at least one data detector; associate thetext element meeting the validation criteria of the at least one datadetector with the at least one data detector; store, for each textelement associated with one data detector of the plurality of datadetector, the content of the text element, in the database; identify atable within the document by calculating for each text element of theplurality of text elements a relative position of at least oneneighboring text element relative to the text element using the absoluteposition of the text element, and comparing the relative positions ofthe plurality of text elements; locate a header for the table bycomparing the content of the text elements within the table with theidentifiers of the plurality of data detectors and then identifying thedata type of the matching text elements, the header being one of a rowand a column; validate, for each identified text element in the header,at least one text element within the other of a row and a columndescribed by the identified text element in the header with thevalidation criteria of the data detector that identified the identifiedtext element in the header; associate, for each identified text elementin the header, at least one validated text element within the other ofthe row and the column described by the identified text element in theheader with the data detector that identified the identified textelement in the header; and identify a potential descriptor by comparingthe content of each text element not part of the table with the at leastone identifier of at least one data detector.
 22. The system of claim21, wherein each text element further comprises a size, and wherein thepotential format of each data detector further comprises a potentialsize.
 23. The system of claim 21, wherein each validation criteriadescribes at least one of a valid format and a valid range.
 24. Thesystem of claim 21: wherein each data detector further comprises atleast one direction describing a potential relative direction of a textelement having a label associated with the data detector; and whereinthe processor is further configured to: determine a source of thedocument by comparing the at least one identifier of a data detectorassociated with a data type that is unique among potential documentsources with the content of each text element of the plurality of textelements; and for each data detector, order at least one of theidentifiers and the directions according to a history stored in thedatabase and associated with the source; and update, for each datadetector, the history associated with the source, according to whichidentifier of the at least one identifier and which direction of the atleast one direction matched the most text elements of the data typedescribed by the data detector in the document.
 25. The system of claim24, wherein determining the source of the document comprises identifyingall postal addresses in the document based upon an observed format,validating each postal address, placing each postal address in astandard format, and comparing each address with a list of addressesunique to each of a plurality of known document sources.
 26. The systemof claim 21, wherein the processor is further configured to: train amachine learning model correlating text elements with the data detectorsthey have been associated with; determine whether the machine learningmodel performs better than one or more data detectors; and automaticallyemploy the machine learning model in place of the one or more datadetectors once the machine leaming model outperforms the one or moredata detectors.
 27. A system for format-agnostic document ingestion,comprising: a document ingestion server comprising a processor and amemory, the document ingestion server communicatively coupled to adatabase, the processor configured to: receive an image of a document,the document comprising text arranged in an unknown format; convert,using optical character recognition, the image of the document into aplurality of text elements, each text element comprising a content andan absolute position within the document; retrieve a plurality of datadetectors from the database, each data detector associated with a datatype that is anticipated to be in the document, each data detectorcomprising at least one identifier that is a potential label, at leastone direction describing a potential relative direction of a textelement having a label associated with the data detector, and at leastone validation criteria; identify a potential descriptor by comparingthe content of each text element with the at least one identifier of atleast one data detector; determine if the text element pointed to by oneof the at least one direction of the data detector used to identify thepotential descriptor meets the validation criteria of the data detector;associate the validated text element with the data detector; and store,for each text element associated with one data detector of the pluralityof data detector, the content of the text element, in the database. 28.The system of claim 27, wherein the processor is further configured to:identify a table within the document by calculating for each textelement of the plurality of text elements a relative position of atleast one neighboring text element relative to the text element usingthe absolute position of the text element, and comparing the relativepositions of the plurality of text elements; locate a header for thetable by comparing the content of the text elements within the tablewith the identifiers of the plurality of data detectors and thenidentifying the data type of the matching text elements, the headerbeing one of a row and a column; validate, for each identified textelement in the header, at least one text element within the other of arow and a column described by the identified text element in the headerwith the validation criteria of the data detector that identified theidentified text element in the header; and associate, for eachidentified text element in the header, at least one validated textelement within the other of the row and the column described by theidentified text element in the header with the data detector thatidentified the identified text element in the header.
 29. The system ofclaim 27: wherein the processor is further configured to identify adocument type by searching the content of each text element for aplurality of distinguishing strings, each distinguishing string beingunique to one document type; and wherein the plurality of data detectorsretrieved from the database is selected based on the document type. 30.The system of claim 27, wherein each identifier is at least one of apotential label and a potential format, and wherein each validationcriteria describes a valid format.
 31. The system of claim 30, whereineach text element further comprises a size, and wherein the potentialformat of each data detector further comprises a potential size.
 32. Thesystem of claim 27, wherein each validation criteria describes at leastone of a valid format and a valid range.
 33. A method forformat-agnostic document ingestion, comprising: receiving, by aprocessor, an image of a document, the document comprising text arrangedin an unknown format; converting, using optical character recognitionperformed by the processor, the image of the document into a pluralityof text elements, each text element comprising a content and an absoluteposition within the document; retrieving a plurality of data detectors,each data detector associated with a data type that is anticipated to bein the document, each data detector comprising at least one identifierthat is a potential label, at least one direction describing a potentialrelative direction of a text element having a label associated with thedata detector, and at least one validation criteria; identifying apotential descriptor by comparing the content of each text element withthe at least one identifier of at least one data detector; determiningif the text element pointed to by one of the at least one direction ofthe data detector used to identify the potential descriptor meets thevalidation criteria of the data detector; associating the validated textelement with the data detector; and storing, for each text elementassociated with one data detector of the plurality of data detector, thecontent of the text element.
 34. The method of claim 33, furthercomprising: identifying a document type by searching the content of eachtext element for a plurality of distinguishing strings, eachdistinguishing string being unique to one document type; wherein theplurality of data detectors retrieved is selected based on the documenttype.
 35. The method of claim 33, wherein each identifier is one of apotential label and a potential format, and wherein each validationcriteria describes a valid format.
 36. The method of claim 35, whereineach text element further comprises a size, and wherein the potentialformat of each data detector further comprises a potential size.
 37. Themethod of claim 33, wherein each validation criteria describes one of avalid format and a valid range.
 38. The method of claim 33, furthercomprising: determining a source of the document by comparing the atleast one identifier of a data detector associated with a data type thatis unique among potential document sources with the content of each textelement of the plurality of text elements; and ordering, for each datadetector, at least one of the identifiers and the directions accordingto a history associated with the source; and updating, for each datadetector, the history associated with the source, according to whichidentifier of the at least one identifier and which direction of the atleast one direction matched the most text elements of the data typedescribed by the data detector in the document.
 39. The method of claim38, wherein determining the source of the document further comprisesidentifying all postal addresses in the document based upon an observedformat, validating each postal address, placing each postal address in astandard format, and comparing each address with a list of addressesunique to each of a plurality of known document sources.
 40. The methodof claim 33, further comprising: training a machine leaming modelcorrelating text elements with the data detectors they have beenassociated with; determining whether the machine leaming model performsbetter than one or more data detectors; and automatically employ themachine learning model in place of the one or more data detectors oncethe machine leaming model outperforms the one or more data detectors.